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Abstract 

Motivated by data-rich experiments in transcriptional regulation and sensory neu- 
roscience, we consider the following general problem in statistical inference. A 
system of interest, when exposed to a stimulus S, adopts a deterministic response 
R of which a noisy measurement M is made. Given a large number of mea- 
surements and corresponding stimuli, we wish to identify the correct "response 
function" relating R to S. However the "noise function" relating M to R is un- 
known a priori. Here we show that maximizing likelihood over both response 
functions and noise functions is equivalent to simply identifying maximally infor- 
mative response functions - ones that maximize the mutual information I[R; M] 
between predicted responses and corresponding measurements. Moreover, if the 
correct response function is in the class of models being explored, maximizing 
mutual information becomes equivalent to simultaneously maximizing every de- 
pendence measure that satisfies the Data Processing Inequality. We note that ex- 
periments of the type considered are unable to distinguish between parametrized 
response functions lying along certain "diffeomorphic modes" in parameter space. 
We show how to derive these diffeomorphic modes and observe, fortunately, that 
such modes typically span a very low-dimensional subspace. Therefore, given 
sufficient data, maximizing mutual information can pinpoint nearly all response 
function parameters without requiring any model of experimental noise. 



1 Introduction 

This paper discusses a familiar problem in statistical inference, but focuses on an under-studied limit 
which is becoming increasingly relevant in both neuroscience and molecular biology. Consider an 
experiment having the following form: 

response function noise function 

e(s) tt(m\r) 

S — ► R ^ — ^—-^ M . (1) 

stimulus response measurement 

When presented with a stimulus S, a system of interest adopts a deterministic response R, of which 
a noisy measurement M is made. Specifically, stimuli are drawn from a probability distribution 
p{S), the response R to each stimulus is determined by a "response function" 6, and a measurement 
M is thus generated with probability given by the "noise function" tt{M\R). We refer to this as an 
"SRM-type" experiment. From a large number of independent stimulus-response pairs, A/„), 
n = 1, 2, . . . , A^, we wish to reconstruct 6. 

This is a standard regression problem and is typically solved |3 1 by first assuming a specific noise 
function vr, then searching a space Q of model response functions for the one 9 ^ Q which maxi- 
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mizes the likelihood 
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p({M„}|{5„},0,^) = e^^('^") where £(0, vr) = - ^ log 7r(M„|0(5„)). (2) 



For instance, the method of least squares regression corresponds to assuming a Gaussian noise func- 
tion TT. Often the assumed noise model is only approximate, or is adopted primarily for analytical 
convenience. Nevertheless, an incorrect vr can work reasonably well, allowing one to infer a tolera- 
bly accurate response model when the data are limiting. 

However, certain experiments in sensory neuroscience and transcriptional regulation operate in the 
data-saturated limit. In such cases, systematic error in 9 caused by an incorrect noise function can 
dominate over the uncertainty due to finite sampling. Such bias has been documented when the 
receptive fields of sensory neurons are characterized using natural stimuli. In one study |24|, anes- 
thetized cats were shown a series of woodland scenes, providing neurons in VI cortex with stimuli 
5 of ~ 10^ - 10'^ pixels each. Measurements M G {spike, no spike} were taken for individual VI 
neurons until as many spikes as relevant pixels had been recorded, yielding N ^ N spike ^ dim(6'). 
From these data the authors inferred a receptive field for each neuron, defined as a stimulus vector 
e such that the projection R ~ S ■ e determined spiking probability. Inference using the stan- 
dard reverse-correlation spike-triggered average [21 , corresponding to maximum likelihood with 
7r(spike|_R) ^ exp[i?] and assuming Gaussian stimuli, was shown to strongly bias the inferred re- 
ceptive field. 

Analogous experiments probing the fine structure of the transcriptional regulatory code are now pos- 
sible [9, 16 , 10 , 14 , 17, 22 , 13,1 , thanks to the development of ultra-high-throughput DNA sequencing 
technologies. To characterize how a specific transcriptional regulatory sequence (TRS) functions, a 
large number (~ lO'* - 10^) of variants S of the TRS are used to control the expression of a gene, 
and a measurement M of the transcription rate R resulting from each variant is made. Modeling 
the quantitative dependence R has on S can then be used to characterize the sequence-dependent 
energy with which each regulatory protein binds the TRS, as well as measure the interaction ener- 
gies between bound factors ifTOl . In this case, the possibility of systematic error from the inference 
procedure distorting biochemical measurements presents a serious concern. 

An alternative inference procedure that is free of systematic bias is to maximize the mutual infor- 
mation [5J between predictions R and measurements M{3 



Here, p{R, M) is the empirical joint distribution of predictions and measurements, and thus de- 
pends implicitly on 6*0 This method has been proposed and applied in the specific contexts of both 
receptive field inference ||23] l24l [TS] ITSl and transcriptional regulation H] [T] ID [TO] [141 . However, a 
general discussion of how maximizing mutual information relates to maximizing likelihood has yet 
to be presented. 

Here we study the general problem of identifying optimal responses models in the ^ oo limit 
when the noise function tt is unknown a priori. We show that maximizing I{6) is equivalent to 
maximizing C{d, tt) over both 9 and tt, and further becomes equivalent to simultaneously maximiz- 
ing every dependence measure which satisfies the Data Processing Inequality (DPI) |5| when some 
candidate 9 fully explains the data. Tests for whether or not an inferred 9 fully explains the data are 
also described. We then address the issue that SRM-type experiments cannot distinguish between 9 
within certain equivalence classes. This leads to "diffeomorphic modes" in parameter space which 
cannot be pinned down by data. An equation for diffeomorphic modes is presented, and is used to 
derive all the diffeomorphic modes of general Unear models and a specific linear-nonlinear model 
that has been studied previously ifTOl . 

Throughout this manuscript, R is specifically used to represent predictions of the model 9, i.e. 
R = 9{S) for an impHcit stimulus S. Similarly R* = 9*{S), Ri = 9i{S), etc.. Responses R 

*The notation 1(6) and I[R; A/] will be used interchangeably. 

^For I{9) to work as an objective function, one typically expects A'^ 3> dim{d) will be required for reliable 
estimation of p{R, M) under all choices of 9. A rapid and accurate method for estimating the density p{R, M) 
from finite data is also needed. 




(3) 
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are assumed to be multidimensional with components {R^}, and 9^ = d/dR'^. 9 denotes both 
a response model and the parameters of that model. O is used to represent both an abstract space 
of response models, as well as the space of parameters for models 9 assumed to have a specific 
functional form. In the latter case, {6*'} denotes coordinates in parameter space, and di = d/dO^. 
Implicit summation notation over repeated indices i or /i is assumed. 

2 Mutual information and likelihood 

In the N ^ oo limit, the per-datum log likelihood of the pair {9, tt) can be decomposed as follows, 

C{9, vr) = J dR dM p{R, M) log7r(Af ^ I{e) - D{9, n) - H[M]. (4) 

The first term on the right is the mutual information (Eq.|3), which is independent of tt. The second 
term, 

D{e, 7r) = I dR dM p(R, M) log j|^M, (5) 

is the Kullback-Leibler (KL) divergence between the empirical distribution p{M\R) observed for 
the response model 9 and the assumed noise function tt{M\R), and thus depends on both 9 and 
TT. The last term, H[M] = — J dM p{M)\ogp{M), is the entropy of the measurements M, is 
independent of both 9 and tt, and can thus be ignored in the optimization problem. 

In the N (X limit, the problem of finding pairs {9, tt) which maximize C{9, n) is identical to 
the problem of only finding response functions 9 which maximize I{9). This follows from the fact 
that, for a given choice of 9, choosing vr to match the empirical noise model, 7r(M|_R) = p{M\R), 
globally minimizes D{9,tt) and causes it to vanish. The maximum likelihood problem therefore 
reduces to the problem of finding 9 which maximize max„ C{9, tt) = I{9) — H[M]; this is identical 
to the problem of maximizing I{9). 

It has been noted that when N is large and one's prior knowledge about vr can be formalized with a 
prior ^(vr), then the per-datum log of the marginal likelihood, C{9), is essentially equal to the mutual 
information I{9) ||8l [T9l . This can be seen by computing the marginal likelihood. 



p({Mn} I {Sn} ,9)= JdTT p(7r)p({Af„} | {5„} , 9, tt) 



where 



A(0) = -llog 



(7) 



is the only term affected by the prior p{tt). Under weak assumptions about p{tt), A — > as — > 
cx)|3 Therefore, 

C{9) = 1 logp({M„} I {Sn},9) = I{9) - A{9) - H[M], (8) 
is equal to I{9) up to a constant and a 6'-dependent correction which vanishes as — > cxo. 



3 DPI-optimal response models 

In practice, one typically searches for optimal 9 within a limited class 8 of possible response models. 
In this section, we present results which obtain when any model 6* G O fully explains the data, i.e. 
I {9) — I (9*) where 9* is the true response function. 

' In certain cases A{6) can be computed explicitly and thus be shown to vanish |t8J. More generally, 
when TT is taken to be finite-dimensional, a saddle-point computation (valid for large N) gives A{6) « 
2^Tr[log ddD] + const. Here, ddD is the vr-space Hessian of D{6, tt) = D{9, tt) — logp(7r) computed 
at 7r(A'f|_R) = p{M\R). If logp(7r) and its derivatives are bounded, then the ^-dependent part of A{9) decays 
as N^^ . If TT is infinite dimensional, this saddle-point computation becomes a problem in field theory akin to 
the inference problem studied by |JJ. If this field theory is properly formulated through an appropriate choice 
of p{tt), A{9) can be expected to exhibit different decay behavior, but still vanish as TV — !> cxd. See also 1191 . 
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First we observe that 9 — 9* globally maximizes I{6) over all possible response functions. Given 
any hypothesized 9 together with 6*, and letting vr* denote the true noise function, the chain of 
stochastic variables 

9(S) 9*(S) . Tr*(M\R*) 
R — — 5* -— * R* — — ^ — i- M (9) 

forms a Markov chain [5], i.e. piR,S,R*,M) = p{R\S)p{S)p{R*\S)p{M\R*). The fact that 
mutual information satisfies DPI allows us to read off the inequality 

I[R; M] < I[S; M] I[R*;M], (10) 

proving that 9 = 9* globally maximizes I[R; M] = I{9). 

As has been noted fTSl, the same argument can be made not just for mutual information, but for 
any dependence measure V[M] R\ which satisfies DPI: the simple fact that Eq.[9]is a Markov chain 
implies V[R;M] < V[R*;M], proving 9 = 9* globally maximizes V{9) = D[i?;M]Q Letting 
©u C 8 denote the set of all G 8 which maximize a dependence measure I?(6'), we see that 
if 9* G 8, then 9* G ©u for every V satisfying DPI. So in fact 9* must be contained within the 
infinite intersection of all such 8xi, which we shall denote by Qdpi- 

9* G Odpi = n Op. (12) 

■p satisfying DPI 



We now prove that when any 6* G 8/ achieves I{9) = I (9*), the set Qdpi of such "DPI-optimal" 
response models is in fact identical to the set 8/ of maximally informative models, i.e. 

6/ = Odpi- (13) 

First, since mutual information satisfies DPI, Qdpi ^ 8/. Next, the fact (from Eq. |9]l that R o 
R* o M is a Markov Chain means R contains no information about M which is not conveyed by 
R*, and so I[R*;M] = I[R*,R; M]. This gives, 

I[R*;M]- I[R;M] = I[R*,R;M] - I[R;M] = I[R*;M\R] (14) 

This conditional mutual information I[R*;AI\R] must be the same for all 6* G 8/. If we fur- 
ther assume I{9) = I{9*) for some (and thus all) 9 G 8/, then I[R*;M\R] = 0. This im- 
plies M|i?) = p(i?*|i?)p(Af|i?), or equivalently, p(A/|i?*,i?) = p{M\R). Therefore, 
R* ^ R ^ M is also a Markov chain. Reconciling this with the fact that i? o i?* O A/ is 
a Markov chain as well, we get V[R; M] = 'D[R*:M] for any DPI-satisfying V. All such V are 
therefore maximized by 9, meaning 8/ C 8 dp/. This completes the proof. 

We pause to offer some intuition for these results. For any hypothesized 9, the resulting joint dis- 
tribution p{R, M) will be a convolution of the true joint distribution p{R* , M) with the conditional 
distribution p{R\R*), 

p{R,M) = J dR*p{R\R*)p{R*,M). (15) 

In general the reverse is not true, i.e. p{R*,M) ^ J dR p{R*\R)p[R, M), because p{R*\R) ^ 
p{R* \R, M). This reflects a basic asymmetry among joint distributions p{R, M): sometimes one 
can be derived from another by convolution, sometimes not. 

All DPI-satisfying measures T>[R, M] are either decreased or left unchanged by such convolutions. 
Every such T> therefore imposes a weak ordering on the space of joint distributions. When neither of 
two distributions p{R, M) and p{R' , M) can be expressed as a convolution of the other, then differ- 
ent DPI satisfying measures T) can potentially rank these distributions differently. However, when 

''We note that there are an infinite number of dependence measures other than mutual information which 
satisfy DPI. For instance, information measures of the /-divergence form illll], 

If[M;R\ = j dRdMp{R)p{M)f (^^^) dD 

satisfy DPI when the function f{x) is convex for a:: > 0. Mutual information corresponds to f{x) = a; log a;, 
while f{x) = {a - l)-'^x°' for a > is the more general Renyi divergence 12011111 . 
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p{R, M) can be gotten from p{R', M) by convolution, then M] < M] is guaranteed. 

Thus, because all p{R,M) under consideration derive from a single p{R*,M) by a convolution 
of the form in Eq. [15] every DPI-satisfying measure V ranks p{R* , M) no lower than any other 
p{R,M). Soif r £ e, one gets 6* e Qdpi- 

The equivalence Qj = Qdpi, realized when I (9) = I (6*) for G 0/, stems from the fact 
that mutual information is maximally sensitive to such convolutions: if VlR; M] < 2?[i?*; M] for 
any measure V satisfying DPI, then I[R; AI] < I[R*; M]. Mutual information is not unusual in 
this regard. For example, every information measure If[M; R] for which f{x) is strictly convex 
satisfies 6/^, = Qdpi- There are, however, some dependence measures which are less sensitive 
than mutual information: the trivial dependence measure V = satisfies DPI, but reveals nothing 
about p(i?,M). 

4 Are the data fully explained? 

We now discuss how to check whether a given G 9/ fully explains the data, i.e. whether I{d) = 
I (9*). Verifying that a maximally informative model is also fully informative is an important part 
of the modeling process. Showing I {6) ^ I (6*) for any G 0/ will prove that the available 
data require a different (or enlarged) space Q of response models. On the other hand, showing 
I (9) — I (9*) means no further information about 9 can be gotten from the data in hand. 

One method |]4| is to directly measure the total stimulus-dependent information I[S; M] in the mea- 
surements. From Eq. [TO]this is seen to equal I{9*). To do this, we rewrite the formula for I[S; M] 
as 

I[S; M] ^ j dS dMp{S, M) log ^^j^ = H[M] " {Hs[M])s (16) 

where the expectation value {■) g is taken over stimuli S drawn fromp(5), and 

Hs[M] = - y" dM p{M\S) \ogp{M\S) (17) 

is the measurement entropy for a particular stimulus S. If one has many measurements for a given 
stimulus S, the entropy Hs[M] can be estimated. If such measurements are available for a rep- 
resentative sample of stimuli S, the expectation value {Hs[M]) ^ in Eq.[T6]can also be estimated. 
This approach has been applied to experiments in both sensory neuroscience |24| and transcriptional 
regulation [9J. In practice, however, experiments must be appropriately designed in order to provide 
the measurements needed to estimate Hs [M] for a large, representative sample of stimuli. 

We therefore propose a second test which does not require modifications to the experiment. Repeat- 
ing the argument of Eq. [I4lwith S in place of R* , one sees that I[9) — I{9*) implies S <^ R M 
is a Markov chain, i.e. p{S\R, M) = p{S\R). Because of this, any function f{S) will satisfy 

for all R and 1/0 The converse is true as well: if Eq. [T8]is satisfied for all functions f{S), then 
I{9) — I{9*). This can be seen by considering f{S) = S{R* — 9*{S)), in which case Eq.fTS] gives 
p{R*\R,M) = p{R*\R). If this holds for all R*, then R* ^ R ^ M is a Markov chain, and so 
I[R;M] = I[R*;M]. 

Therefore, if any function f{S) can be found which violates Eq. [T8]for any 9 G 6/, then 9* ^ 6. 
A down-side to this test is its open-ended nature. One must try different functions f{S), of which 
there are an infinite number We suggest that, as a practical matter, choosing f{S) — 9'{S) for other 
9' <E Q encountered in the process of searching for Qj might make sense. Alternatively, setting / 
equal to the components of the gradient diR^^ seems sensible, since Eq. [18] applied to / = diR^^ 
causes d,I(9) to vanish at 9 ^9* L23II91. 

5 Information equivalence and diffeomorphic modes 

Certain response models cannot be distinguished from one another by any SRM-type experiment 
because their predictions are always equally informative about measurements. We say that two such 

^(') s\R M denotes averaging with respect to p{S\R, M); {■) g^j^ corresponds to averaging over p{S\R). 
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models 9i and 02 are "information equivalent", and write 0i ~ 62, since this insensitivity of SRM- 
type experiments leads to a natural equivalence relation among response models. While 8/ can 
sometimes be influenced by a specific experiment's stimulus distribution p{S) and noise function 
TT* {AI\R) (an issue we will not pursue here), information equivalence places hard constraints on the 
structure of 8/, implying that certain equivalence classes within 8 must either be fully contained 
within 8/ or fully excluded. 

We now prove that 9i ~ 62 if and only if the predictions of 9i and 62 are isomorphic, i.e. there 
exists an invertible function /such that 6*1 (S") = f{d2{S)) and 6*2(5') = /" ^(6'i (5)) for aU possible 
stimuli S. First, such an isomorphism implies p(M|i?i) = p(Af |/(i?2)) = p{M\R2), which 
means I[Ri; AI] — I[R2; M], and thus 9i ~ 02- Going the other direction, we can imagine an 
SRM-type experiment in which 9* — 9i and p{M\R*) — 5{M ~ R*). Eai-fier we showed that, 
if /[i?2; M] — I[R*]M], then i?* o i?2 M is a Markov chain. With our choice of response 
function and noise function, i?i o i?2 *H> i?i is thus a Markov chain, implying i?i must be a 
deterministic function of i?2. Imagining the same experiment with 9* = 92 instead, we see that this 
function must be invertible. 

If all 6* G 8 of have a specific parametric form, information-equivalence implies that moving a 
response model 9 along certain directions in parameter space may not change I{9). Consider what 
happens when 9, having parameters {9^}, is infinitesimally transported along a vector field g^{9), 
yielding a new model 9' with components 0" = 0' + eg^{9). Each prediction R, having components 
{R^} in response space, will thus be transformed to R'^ = i?^ + eg^{9)diR^ . If 9' ~ 9, the change 
R''^ — R^ = eg'^{9)diR^ must, for all stimuli S, be fully specified by the value of predictions R 
and parameters 9 and not otherwise depend on S. There must therefore be a vector field h^^ {R, 9) in 
response space satisfying 

g'{9)d^R'' = h''{R,9). (19) 

We refer to vector fields (9) which satisfy this equation as "diffeomorphic modes". Movement of 
any model 9 along its corresponding vector g^{9) induces a diffeomorphism of responses, predicted 
for all possible stimuli, defined by flows of the vector field 9)^ 

Importantly, diffeomorphic modes correspond to continuous changes of model parameters which 
cannot, in principle, be constrained by SRM-type data. The parametric form assumed for all 6* e 8 
determines which diffeomorphic modes exist, and identifying these modes analytically is critical 
when analyzing real data. For instance, if one is able to sample 8/, e.g. using Monte Carlo tech- 
niques, then the position of G 8/ along diffeomorphic modes may have to be artificially fixed 
in order to arrive at values for individual model parameters . We therefore turn to the problem of 
computing diffeomorphic modes for models of different functional form. 

5.1 General linear response models 

Linear response models are of particular interest. In neuro science they are commonly used to rep- 
resent neuron receptive fields, and the resulting challenge of identifying "maximally informative 
dimensions" in stimulus space has received focused attention [23 , 24|. In transcriptional regulation, 
linear "energy matrix" models are often used to represent the sequence-dependent binding energies 
of transcription factors, and the problem of inferring these from microarray data [8. 7 1 and DNA 
sequence data |l9][l0l[T4l has also been studied. 

Here we derive the diffeomorphic modes of arbitrary linear response models. Assume 

j^f. ^ e'F^{S) (21) 

for some set of stimulus features F^{S). Note that these models are linear in their parameters but 
not necessarily linear in the stimulus 5*. To find the diffeomorphic modes, we apply Eq. [T9]to Eq. 

* Alternatively, one can define diffeomorphic modes in terms of the generator equation ||9), 

g\e)d, = h''{R,e)d^. (20) 

We have found that working with this formulation eases notation and aids interpretability when deriving the 
diffeomorphic modes of a specific parametric model, but we will use Eq.[T9]in what follows for the sake of 
concreteness. 
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l2n giving g'^{9)F^{S) — h^{9'^F^{S), 9). The left hand side is linear in stimulus features, and so 
h^{R, 9) musfl also be a linear function of R, i.e. have the highly restricted form 

h^'{R,9) = a^'{9)+hii{9)Rr (22) 

Thus, the number of diffeomorphic modes of a general linear model, given by the number of pa- 
rameters on which /i^ depends at each 9, is bounded above by dim(i?)[dim(i?) + 1]. Importantly, 
this bound is independent of the number of stimulus features (i.e. dim(S')); it depends only on the 
dimension of response space. In particular, if i? is a scalar, then there are at most 2 diffeomorphic 
modes, corresponding to additive and multiplicative transformations of R. 



5.2 A linear-nonlinear response model 

We now show that combining multiple linear response models into a single linear-nonlinear model 
can eliminate diffeomorphic modes. This fact proved useful in a recent study by Kinney et al., 1 10|. 
In the context of their work, each stimulus S was a mutated version of a 75 base pair region of the 
Escherichia coli lac promoter DNA. A linear response function P was used to model the binding 
energy of RNA polymerase to its site on this promoter, while a separate linear function Q was used 
to model the interaction of the transcription factor CRP to its promoter binding site. The resulting 
rate of mRNA transcription was represented by the "regulation factor" R |i2J, which is related to the 
equilibrium occupancy of RNAP polymerase at its binding site {occupancy = [1 + R^^]^^), and 
thus to the rate of mRNA transcription. In terms of P and Q, the regulation factor R was given by 



where 7 is the interaction energy between CRP and RNA polymerase. Note that, in this equation, 
the energies P, Q, and 7 are all in units of kgT. 

We now derive the diffeomorphic modes of R. Since P and Q depend on sequence features that can 
be varied independently, any diffeomorphic mode of R has to be a diffeomorphic mode of both P 
and Q. Since P and Q are linear in their parameters, and the only other parameter in the model is 7, 
any diffeomorphic mode of R must have the form 

g\9)diR = h{R, 9) = (ap + hpP)dpR + {aq + bQQ)dQR + a^d-^R. (24) 

Again, the coefficients ap,bp,aQ,bQ, a-f can be arbitrary functions of any of the model parameters, 
but cannot depend on S. Computing the derivatives and then substituting for P in terms of Q and 
R, we find 

R{l + e-(^)\ (aQ + 6QQ)e-Q(l-e-T) a^e-'^'^ 



1 + 



h{R,9) = -R 



ap — bp log 



(25) 



l + e-Q-y j (l + e-Q-T)(l + e-Q) 1 + e-Q" 

For (7* (9) to be a diffeomorphic mode, the right hand side must be independent of S for fixed R. But 
Q depends on 5, so we must have bp — aq — bq = a^ — OQ Diffeomorphic modes of R are thus 
defined by only one parameter, ap, and satisfy 

g\9)d,R= -apR, (26) 

corresponding to an additive shift of P. 

In IfTOl , measurements for ~ 5 x 10"* mutant lac promoters were used to infer models for P and 
Q individually as well as in the context of R. When P and Q were inferred individually, each was 
determined only up to an unknown affine transformation. However, when P and Q were inferred 
simultaneously by fitting R, three of the four diffeomorphic modes of P and Q vanished, leaving 
only the additive mode shown in Eq. |26] Thus, inferring the nonlinear function R allowed the 
binding energies of RNA polymerase and CRP to be determined in meaningful physical units {ksT), 
and the intracellular concentration of CRP, which manifests as an additive contribution to Q, to be 
pinned down lITOl . In fact, of the 204 independent parameters which defined the model in Eq. |23] 
the only parameter which could not be pinned down by data was the single diffeomorphic mode in 
Eq. |26] corresponding to changes in RNA polymerase concentration. 



^There are exceptions to this statement, e.g. if the various features Fj^ (S) exhibit complicated interdepen- 
dencies, either because of their functional form or because stimuh S are restricted to a particular subspace. We 
ignore such pathological cases here. 

'*This assumes 7 7^ 0. 
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6 Discussion 



Its inability to pin down diffeomorphic modes distinguishes mutual information from likelihood in 
an important and revealing way. When maximizing likelihood with an assumed noise model, all 
response model parameters are constrained by datajj However, the constraints likelihood places on 
diffeomorphic modes come entirely from the KL-divergence (Eq.|4]i, which enforces the assumption 
that the empirical noise function p{M\R) should match the assumed noise function Tr{M\R). The 
relative likelihood of response models 9 along diffeomorphic modes is exp[~ND{9,Tr)], and so 
the weight given to one's assumed noise function tt grows with TV. If there is any uncertainty 
whatsoever about what the true noise function is, this term will become overly presumptuous when 
N is sufficiently large. 

A more rigorous approach is to place an explicit prior on possible noise functions tt, and then 
optimize the response model 9 using the marginal likelihood in Eq.|6] This allows one's prior belief 
about the noise function to influence the choice of response model when N is small, but the relative 
influence of this prior diminishes as N becomes large. This "noise-function-averaged" or "error- 
model-averaged" likelihood can be computed exphcitly in certain cases and has proven useful on 
real data |8 1. However, in the large N limit the resulting inference procedure essentially amounts to 
first identifying maximally informative 6, then using the prior on vr to fix the diffeomorphic modes. 

Regardless of the specific implementation, one's inference procedure should reflect the fact that 
SRM-type experiments are fundamentally insensitive to diffeomorphic modes of the response 
model. Any constraints along diffeomorphic modes must come from a source of information other 
than the SRM data itself, e.g. a separate calibration experiment. 

One might worry that a large number of response model parameters will be diffeomorphic, and 
that SRM-type experiments will effectively require an assumed noise function if they are to yield 
useful results. Such situations are conceivable, but in practice this is often not the case. When 
the stimulus S is high-dimensional and the response R is low dimensional, the vast majority of 
model parameters will typically be involved in reducing the dimensionality of S; very few will only 
parametrize diffeomorphisms of R. We showed that when 9 is linear in its parameters, the number 
of diffeomorphic modes will not exceed dim(_R)[dim(_R) + 1] (except in pathological cases) . This 
holds regardless of how large dim(6') is. In the specific linear-nonlinear model considered by llTOl 
(Eq.|23ll, only one of the 204 independent parameters turned out to be diffeomorphic. So although 
diffeomorphic modes do appear in real-world applications, they are often very limited in number, 
and in such cases the vast majority of response model parameters can be inferred from SRM-type 
data without any systematic error stemming from an incorrect noise function. 

Unfortunately, using mutual information as an objective function can present practical difficulties. 
One must be able to rapidly and reliably estimate 1(6) from finite data, and the resulting 1(6) may 
present a rugged optimization landscape. Still, various methods for estimating mutual information 
have been implemented (e.g. [ 12,,25J), and the information optimization problem has been success- 
fully addressed in specific situations using stochastic gradient ascent ||23ll24l . standard Metropolis 
Monte Carlo |8|, and parallel tempering Monte Carlo |TO"T4l. How best to address these practi- 
cal issues remains an open question, but we believe the exciting applications in neuroscience and 
molecular biology provide compelling reasons to make progress on these problems. 
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